Table of contents

  1. Introduction
    1.1. Description of our data
    1.2. Getting the data
    1.3. Preparing the data
  2. Graphical Analysis
    2.1 Map of earthquakes
    2.2 Plotting continous variables
    2.3 Plotting categorical variables
  3. Linear Model
    3.1.Characteristics of a Linear Model
    3.2.Research Question
    3.3.Data Cleaning and Graphical Analysis
    3.4.Fitting the Linear Model
  4. Generalised Linear Model set to Poisson
    4.1.Characteristics of a GLM set to Poisson
    4.2.Research Question
    4.3.Fitting the Poisson GLM
    4.4.Model Interpretation and Evaluation
  5. Generalised Linear Model set to Binomial
    5.1.Characteristics of a GLM set to Binomial
    5.2.Research Question
    5.3.Fitting the Binomial GLM
    5.4.Model Interpretation and Evaluation
  6. Generalised Additive Model
    6.1. Characteristics of a GAM
    6.2. Research Question
    6.3. Fitting a GAM
    6.4. Model Interpretation and Evaluation
  7. Neural Network
  8. Support Vector Machine
  9. Optimisation Problem
  10. Conclusion

#!!!Updates by Andrea!!!: - this section can be deleted after updates. - Table of contents - copy the whole everywhere updates - chunk imports - all inputs in: 2.2 Plotting continous variables
2.3 Plotting categorical variables 3. Linear Model
3.1.Characteristics of a Linear Model
3.2.Research Question
3.3.Data Cleaning and Graphical Analysis
3.4.Fitting the Linear Model
6. Generalised Additive Model
6.1. Characteristics of a GAM
6.2. Research Question
6.3. Fitting a GAM
6.4. Model Interpretation and Evaluation - chapter 4.4 added a button.


1. Introduction

Earthquakes are one of the most destructive natural disasters that can strike without warning, causing extensive damage to infrastructure, loss of life, and massive economic losses. While we cannot prevent earthquakes from occurring, the ability to accurately predict when and where they might occur could save countless lives and minimize the damage caused.

Therefore, our aim with this report is to contribute the significant earthquake prediction which enables to provide advanced warning of potentially catastrophic seismic events, allowing governments and communities to prepare and take necessary measures to minimize the impact of such events.


1.1.Description of our data

Data source: https://www.ngdc.noaa.gov/hazel/view/hazards/earthquake/search

The Significant Earthquake Database contains information on destructive earthquakes from 2150 B.C. to the present that meet at least one of the following criteria: Moderate damage (approximately $1 million or more), 10 or more deaths, Magnitude 7.5 or greater, Modified Mercalli Intensity X or greater, or the earthquake generated a tsunami. The database can also be displayed and extracted with the Natural Hazards Interactive Map.


Below we are listing a short summary of our main variables. At the primary and secondary deaths and damages where available the total numbers have been added to the dataset, in the “description” field the variables have already been clustered.


Valid values: 0.0 to 9.9

The value in this column contains the primary earthquake magnitude. Magnitude measures the energy released at the source of the earthquake. Magnitude is determined from measurements on seismographs. For pre-instrumental events, the magnitudes are derived from intensities. There are several different scales for measuring earthquake magnitudes. The primary magnitude is chosen from the available magnitude scales in this order:

Mw Magnitude
Ms Magnitude
Mb Magnitude
Ml Magnitude
Mfa Magnitude
Unknown Magnitude


Valid values: 1 to 12

The Modified Mercalli Intensity (Int) is given in Roman Numerals (converted to numbers in the digital database). An interpretation of the values is listed below.

Table 1. Modified Mercalli Intensity Scale of 1931

  1. Not felt except by a very few under especially favorable circumstances.

  2. Felt only by a few persons at rest, especially on upper floors of buildings. Delicately suspended objects may swing.

  3. Felt quite noticeably indoors, especially on upper floors of buildings, but many people do not recognize it as an earthquake. Standing motor cars may rock slightly. Vibration like passing truck. Duration estimated.

  4. During the day felt indoors by many, outdoors by few. At night some awakened. Dishes, windows, and doors disturbed; walls make creaking sound. Sensation like heavy truck striking building. Standing motorcars rock noticeably.

  5. Felt by nearly everyone; many awakened. Some dishes, windows, etc., broken; a few instances of cracked plaster; unstable objects overturned. Disturbance of trees, poles, and other tall objects sometimes noticed. Pendulum clocks may stop.

  6. Felt by all; many frightened and run outdoors. Some heavy furniture moved; a few instances of fallen plaster or damaged chimneys. Damage slight.

  7. Everybody runs outdoors. Damage negligible in buildings of good design and construction slight to moderate in well built ordinary structures; considerable in poorly built or badly designed structures. Some chimneys broken. Noticed by persons driving motor cars.

  8. Damage slight in specially designed structures; considerable in ordinary substantial buildings, with partial collapse; great in poorly built structures. Panel walls thrown out of frame structures. Fall of chimneys, factory stacks, columns, monuments, walls. Heavy furniture overturned. Sand and mud ejected in small amounts. Changes in well water. Persons driving motor cars disturbed.

  9. Damage considerable in specially designed structures; well-designed frame structures thrown out of plumb; great in substantial buildings, with partial collapse. Buildings shifted off foundations. Ground cracked conspicuously. Underground pipes broken.

  10. Some well-built wooden structures destroyed; most masonry and frame structures destroyed with foundations; ground badly cracked. Rails bent. Landslides considerable from river banks and steep slopes. Shifted sand and mud. Water splashed over banks.

  11. Few, if any (masonry), structures remain standing. Bridges destroyed. Broad fissures in ground. Underground pipelines completely out of service. Earth slumps and land slips in soft ground. Rails bent greatly.

  12. Damage total. Waves seen on ground surfaces. Lines of sight and level distorted. Objects thrown upward into the air.


The depth of the earthquake is given in kilometers.


Regional boundaries defined as follows:

150 - North America and Hawaii: (Canada, Mexico, USA)

100 - Central America: (Costa Rica, El Salvador, Guatemala, Honduras, Nicaragua, Panama)

90 - Caribbean: (Antigua and Barbuda, Barbados, Cuba, Dominican Republic, French Guiana, Grenada, Guadeloupe, Haiti, Jamaica, Martinique, Puerto Rico, Saint Lucia, Saint Vincent and the Grenadines, Trinidad and Tobago, U.S. Virgin Islands)

160 - South America: (Argentina, Bolivia, Brazil, Chile, Colombia, Ecuador, Peru, Venezuela)

70 - Atlantic Ocean

15 - Northern Africa: (Algeria, Egypt, Libya, Morocco, Sudan, Tunisia)

10 - Central, Western and S. Africa: (Burundi, Cameroon, Canary Islands, Central African Republic, Congo, Coite DIvoire, Ethiopia, Gabon, Ghana, Guinea, Guyana, Malawi, Mozambique, Rwanda, Sierra Leone, South Africa, Tanzania, Togo, Uganda, Zambia)

20 - Antarctica

120 - Northern and Western Europe: (Austria, Belgium, France, Germany, Iceland, Netherlands, Switzerland, United Kingdom)

130 - Southern Europe: (Azores (Portugal), Black Sea, Bosnia-Herzegovina, Croatia, Cyprus, Greece, Italy, Macedonia, Portugal, Serbia and Montenegro, Slovenia, Spain)

110 - Eastern Europe: (Bulgaria, Hungary, Poland, Romania, Slovakia, Ukraine)

140 - Middle East: (Iran, Iraq, Israel, Jordan, Lebanon, Saudi Arabia, Syria, Turkey, Yemen)

40 - Central Asia and Caucasus: (Afghanistan, Armenia, Azerbaijan, Black Sea, Western China, Georgia, Kazakhstan, Kyrgyzstan, Mongolia, Russia, Tajikistan, Turkmenistan, Uzbekistan)

30 - East Asia: (Eastern China, East China Sea, Japan, Japan Sea, North Korea, South Korea, Taiwan, Yellow Sea)

60 - S. and SE. Asia and Indian Ocean: (Bangladesh, Bhutan, India, Indian Ocean, Myanmar (Burma), Nepal, Pakistan, Sri Lanka, Thailand, Vietnam)

170 - Central and South Pacific: (Australia, Caroline Islands, Celebes Sea, Cook Islands, Fiji, French Polynesia, Guam, Indonesia, Kermadec Islands (New Zealand), Kiribati, Malaysia, Rep. of Marshall Islands, Fed. States of Micronesia, New Caledonia, New Zealand, Northern Mariana Islands, Pacific Ocean, Papua New Guinea, Philippines, Samoa, Solomon Islands, Solomon sea, South china sea, South Georgia and the South Sandwich Islands, Tasman Sea, Timor Sea, Tonga, Vanuatu)

80 - Bering Sea

50 - Kamchatka and Kuril Islands


Associated Tsunami or Seiche [Tsu]
When a tsunami or seiche was generated by an earthquake, An icon appears in the Associated Tsunami column which is linked to the tsunami event database. The link will display additional tsunami event information.

Volcano [Vol]
The Volcano link will display additional information if the earthquake was associated with a volcanic eruption. The information may include information such as the VEI index, morphology, and the effects of the eruption.


Description of Deaths from the Earthquake [Death.Description]

Valid values: 0 to 4

When a description was found in the historical literature instead of an actual number of deaths, this value was coded and listed in the Deaths column. If the actual number of deaths was listed, a descriptor was also added for search purposes.

0 None
1 Few (~1 to 50 deaths)
2 Some (~51 to 100 deaths)
3 Many (~101 to 1000 deaths)
4 Very many (over 1000 deaths)

Description of Damage from the Earthquake [Damage.Description]

Valid values: 0 to 4

For those events not offering a monetary evaluation of damage, the following five-level scale was used to classify damage (1990 dollars) and was listed in the Damage column. If the actual dollar amount of damage was listed, a descriptor was also added for search purposes.

0 NONE
1 LIMITED (roughly corresponding to less than $1 million)
2 MODERATE (~$1 to $5 million)
3 SEVERE (~$5 to $25 million)
4 EXTREME (~$25 million or more)

[Missing.Description]
[Injuries.Description]
[Houses.Destroyed.Description]
[Houses.Damaged.Description]


Description of Deaths from the Earthquake and secondary effects (eg Tsunami)
Valid values: 0 to 4

When a description was found in the historical literature instead of an actual number of deaths, this value was coded and listed in the Deaths column. If the actual number of deaths was listed, a descriptor was also added for search purposes.

0 None
1 Few (~1 to 50 deaths)
2 Some (~51 to 100 deaths)
3 Many (~101 to 1000 deaths)
4 Very many (over 1000 deaths)

[Total.Death.Description]
[Total.Missing.Description]
[Total.Injuries.Description]
[Total.Damage…Mil.]
[Total.Houses.Destroyed.Description]
[Total.Houses.Damaged.Description]


1.2.Getting the data

We first setting the working directory, than we are loading the tab separated data file to R.

#setwd("/Users/esinisik/Library/Mobile Documents/com~apple~CloudDocs/Uni/MAS/Sem2/Applied Machine Learning and Predictive Modelling 1/ML 1 Project")
eqdata <- read.csv("significant-earthquakes-database-country-region.tsv", header = TRUE, sep = "\t")

At first we will have a look at the data provided:

str(eqdata)
## 'data.frame':    6367 obs. of  41 variables:
##  $ Search.Parameters                 : chr  "[]" "" "" "" ...
##  $ Year                              : int  NA -2150 -2000 -2000 -1610 -1566 -1450 -1365 -1250 -1050 ...
##  $ Mo                                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Dy                                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr                                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Mn                                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Sec                               : num  NA 0 NA NA NA 0 NA NA 0 0 ...
##  $ Tsu                               : int  NA NA 1 NA 3 NA NA 4 NA NA ...
##  $ Vol                               : int  NA NA NA NA 1351 NA NA NA NA NA ...
##  $ Country                           : chr  "" "JORDAN" "SYRIA" "TURKMENISTAN" ...
##  $ Region                            : int  NA 140 130 40 130 140 130 140 140 140 ...
##  $ Location.Name                     : chr  "" "JORDAN:  BAB-A-DARAA,AL-KARAK" "SYRIA:  UGARIT" "TURKMENISTAN:  W" ...
##  $ Latitude                          : num  NA 31.1 35.7 38 36.4 ...
##  $ Longitude                         : num  NA 35.5 35.8 58.2 25.4 35.3 25.5 35.8 35.5 35 ...
##  $ Focal.Depth..km.                  : int  NA NA NA 18 NA NA NA NA NA NA ...
##  $ Mag                               : num  NA 7.3 NA 7.1 NA NA NA NA 6.5 6.2 ...
##  $ MMI.Int                           : int  NA NA 10 10 NA 10 10 NA NA NA ...
##  $ Deaths                            : int  NA NA NA 1 NA NA NA NA NA NA ...
##  $ Death.Description                 : int  NA NA 3 1 NA NA NA NA NA NA ...
##  $ Missing                           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Missing.Description               : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Injuries                          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Injuries.Description              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Damage...Mil.                     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Damage.Description                : int  NA 3 NA 1 NA 3 NA 3 3 3 ...
##  $ Houses.Destroyed                  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Houses.Destroyed.Description      : int  NA NA NA 1 NA NA NA NA NA NA ...
##  $ Houses.Damaged                    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Houses.Damaged.Description        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Deaths                      : int  NA NA NA 1 NA NA NA NA NA NA ...
##  $ Total.Death.Description           : int  NA NA 3 1 3 NA NA NA NA NA ...
##  $ Total.Missing                     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Missing.Description         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Injuries                    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Injuries.Description        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Damage...Mil.               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Damage.Description          : int  NA NA NA 1 3 NA NA 3 NA NA ...
##  $ Total.Houses.Destroyed            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Houses.Destroyed.Description: int  NA NA NA 1 NA NA NA NA NA NA ...
##  $ Total.Houses.Damaged              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Houses.Damaged.Description  : int  NA NA NA NA NA NA NA NA NA NA ...

The original data set has 6366 observations of 42 variables.


1.3 Preparing the data

After loading we have decided to perform the below cleaning our dataset: Cleaning Process Exclude data dated before 1900

  • Our reasons are the following:

    • data available mostly from historical records therefore less reliable

    • the measurement quality is not reliable based on less developed methods. The modern seismometer wasn’t invented until the mid 18 hundreds. Therefore, it can be suggested that these modern technologies were not widely used around the world until the 19-hundreds.

    • we had a lot of missing values from these records

Exclude data where magnitude is not available

    • the variable of magnitude has key importance in our analysis, the records where it is missing are therefore too unreliable to consider in the analysis.

Exclude data where the number of deaths is not available

    • the death count is also a key variable within the scope of our analysis. Since there is no information available whether the NA can be treated as 0 or as true unknowns, the decision has been taken to exclude records without such value.

Clean column names for better handling

    • Some of the columns (e.g. “Focal.Depth..km.”) will be rewritten for better handling in the analysis.

Enrichment Process Add magnitude column without decimals

    • At certain stages of the analysis, it can be beneficial to consider the number of magnitude as counts.


str(eqdata)
## 'data.frame':    1469 obs. of  44 variables:
##  $ Search.Parameters                 : chr  "" "" "" "" ...
##  $ Year                              : int  1900 1900 1901 1901 1901 1902 1902 1902 1902 1902 ...
##  $ Mo                                : int  7 10 3 3 8 1 2 3 4 7 ...
##  $ Dy                                : int  12 29 31 31 9 16 13 9 19 9 ...
##  $ Hr                                : int  6 9 7 7 18 NA 9 7 2 3 ...
##  $ Mn                                : int  25 11 11 12 33 NA 39 46 23 38 ...
##  $ Sec                               : num  0 0 NA NA 45 NA 30 0 30 0 ...
##  $ Tsu                               : int  NA 1276 NA 3725 1282 NA NA NA 5063 NA ...
##  $ Vol                               : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Country                           : chr  "TURKEY" "VENEZUELA" "BULGARIA" "BULGARIA" ...
##  $ Region                            : int  140 90 110 110 30 150 40 140 100 140 ...
##  $ Location.Name                     : chr  "TURKEY:  KARS,KARAKURT,KAGIZMAN,DIGOR" "VENEZUELA:  MACUTO" "BULGARIA:  BALCHIK, KAVARNA, BLATNITSA, LIMANU" "BULGARIA:  BALCHIK" ...
##  $ Latitude                          : num  40.3 11 43.4 43.4 40.6 17.6 40.7 40.7 14 27.1 ...
##  $ Longitude                         : num  43.1 -66 28.7 28.5 142.3 ...
##  $ Focal.Depth..km.                  : int  NA NA NA NA 33 NA 15 NA 33 NA ...
##  $ Mag                               : num  5.9 7.7 6.4 7.2 8.2 7 6.9 5.5 7.5 6.3 ...
##  $ MMI.Int                           : int  8 10 8 10 NA NA 9 9 NA 8 ...
##  $ Deaths                            : int  140 25 4 4 18 2 86 4 2000 10 ...
##  $ Death.Description                 : int  3 1 1 1 1 1 2 1 4 1 ...
##  $ Missing                           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Missing.Description               : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Injuries                          : int  NA NA 50 50 NA NA 60 NA NA NA ...
##  $ Injuries.Description              : int  NA NA 1 1 NA NA 2 NA NA NA ...
##  $ Damage...Mil.                     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Damage.Description                : int  3 3 3 3 2 2 4 3 3 2 ...
##  $ Houses.Destroyed                  : int  1100 NA 1200 1200 8 NA 3496 NA NA NA ...
##  $ Houses.Destroyed.Description      : int  4 3 4 4 1 NA 4 NA NA NA ...
##  $ Houses.Damaged                    : int  900 NA 1200 NA NA NA 3496 NA NA NA ...
##  $ Houses.Damaged.Description        : int  3 3 4 NA NA NA 4 NA NA NA ...
##  $ Total.Deaths                      : int  140 25 4 4 18 2 86 NA 2000 10 ...
##  $ Total.Death.Description           : int  3 1 1 1 1 1 2 NA 4 1 ...
##  $ Total.Missing                     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Missing.Description         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Injuries                    : int  NA NA 50 50 NA 2 60 NA NA NA ...
##  $ Total.Injuries.Description        : int  NA NA 1 1 NA 1 2 NA NA NA ...
##  $ Total.Damage...Mil.               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Damage.Description          : int  3 3 NA 3 2 2 4 NA 3 2 ...
##  $ Total.Houses.Destroyed            : int  1100 NA 1200 1200 8 NA 3496 NA NA NA ...
##  $ Total.Houses.Destroyed.Description: int  4 3 4 4 1 NA 4 NA NA NA ...
##  $ Total.Houses.Damaged              : int  900 NA NA NA NA NA NA NA NA NA ...
##  $ Total.Houses.Damaged.Description  : int  3 3 NA NA NA NA NA NA NA NA ...
##  $ Mag.full                          : num  5 7 6 7 8 7 6 5 7 6 ...
##  $ Freq.Region                       : int  246 24 12 12 242 89 104 246 44 246 ...
##  $ Freq.Country                      : int  104 13 4 4 63 53 4 104 15 138 ...

The original data set has 6367 observations of 41 variables.


After cleaning the dataset we have 1469 observations of 42 variables remaining. We will proceed to perform all the following analysis with the transformed data set.


2. Graphical Analysis

2.1 Map of earthquakes

## [1] "sf"         "data.frame"


2.2 Plotting continous variables

In the below histograms we plot the continuous variables of our data to see the distribution of the data points.

Remark:
- Year we are taking as a continuous variable first
- Outcomes of an earthquake in terms of death and damage numbers are all count numbers, therefore we take the logarithms of these values in the below plots to better see the distribution of the data



Above we can see that:
- Magnitude is rather normally distributed
- Focal depth is left skewed
- Year is rather right skewed

The outcomes of an earthquake are on a logarithmic scale: Here we also see a close to normal distribution except the number of deaths, which is rather left skewed and a rather random distribution with the number of missing people.


2.3 Plotting categorical variables


Let us visualize the categorical variables of our data.



Looking at the categorical variables:
- Intensity has also a rather normal distribution, however we can see a large number of NA values
- The region and country variables we have sorted in decreasing order

Region: The below regions have the higest number of occurence in our database, which means the highest number of earhquakes registered: - 140: Middle East
- 30: East Asia
- 160: South America
- 60: S. and SE. Asia and Indian Ocean
- 170: Central and South Pacific
- 130: Southern Europe
- 40: Central Asia and Caucasus
- 150: 150 - North America and Hawaii

As we can see the most frequently mentioned countries are also from the regions mentioned in the above regions.

Based on these charts and the World map visualizing the data based on latitude and longitude we assume that location has an influence on the likelihood of an earthquake happening.



Looking at the categorical outcome variables we can see that:
- Variables Missing description, Houses Damaged, Houses Destroyed have relative low number of data points.
- Except Death Description and Damage Description we have a noticeable amount of NA values
- Further on we can see that Death description does not have NA values and has rather a right skewed distribution.
- Damage description has a low amount of NA values, but the distribution is more balanced.


3. Linear Model


3.1 Characteristics of a Linear Model

Generally linear models are never completely correct, but the interpret ability of the linear model is relatively high compared to other more complex models. The danger of over fitting is generally less with linear models. This is the reason, we start our statistical analysis with a multiple linear regression below.

Linear regression models are unsupervised models, which means we want to predict how the dependent variable changes with changing the independent variables. Regression models the dependent variable takes quantitative measures and continuous.


3.2 Research Question

Given that in the simple linear regression, the dependent variable shall be a continuous and numeric one we will start our analysis taking the magnitude of an earthquake as a dependent variable.

The magnitude of an earthquake indicates the released energy of the movement, therefore it is an important indicator of an earthquake. The below independent variables available in our data set, which influence the magnitude of an earthquake:

  • location (longitude, latitude)
  • focal depth of an epicenter
  • time (Year) as independent variables in our model

The other variables: intensity, number of deaths, caused injuries and damages are logically either an outcome of an earthquake, as well they are all counted values, which we will look at later in our report with other more suitable statistical models.


3.3 Data Cleaning and Graphical Analysis

All records have a magnitude value, given we removed already missing values at the first stage of our data cleaning process.

Below we filter out all values where year, focal dept, longitude or latitude is missing and will run our analysis on the below subset of the data:

## 'data.frame':    1314 obs. of  5 variables:
##  $ Year            : int  1901 1902 1902 1902 1902 1903 1904 1905 1905 1905 ...
##  $ Mag             : num  8.2 6.9 7.5 7.7 6.4 7.8 6.1 7.8 6.6 7.8 ...
##  $ Focal.Depth..km.: int  33 15 33 30 9 100 10 25 20 100 ...
##  $ Longitude       : num  142.3 48.6 -91 76.2 72.3 ...
##  $ Latitude        : num  40.6 40.7 14 39.9 40.8 ...

Our final data set eqdata.no.na.mag has 1314 rows.

First look at the distribution of the response variable, magnitude with the below histogram. As we can see the highest frequency of the values is between magnitude 6 and 7, the values are decreasing towards zero and towards the value 9.9. The histogram shows a light left skewed distribution, however showed on the Q-Q plot the distribution is very closed to normal distribution, which we will now take as a prerequisite assumption for our further investigation in this chapter with a multiple linear regression regression model.



In the next step of our graphical analysis we look at the relationship between Magnitude and Latitude and Longitude:



Graph A: It is clearly visible that between the latitude 30 and 60 the number of data points is increasing.This corresponds to the map shown before, this latitude corresponds to the northern hemisphere to the most populated regions: majority of the territories of North America, Europe and Asia is located in this latitude range. We can see a negative correlation. With each unit of increasing the latitude the magnitude of the earthquakes seemingly decreasing. The smoother on this graph is close to a perfect straight line. Therefore, we may assume that there is no hint that the relationship between the response variable and the latitude predictor maybe non-linear.


Graph B: On the plot with the longitude values, we can see two groups: First is located around the value -100 this value corresponds to the West coast of the North American region. The bigger group of data points is located between 0 and +150 longitude. These values correspond to the Eurasian continent. In both groups is the number of data points higher. However here the regression line is almost parallel with the X axis, very flat around the value of 6 magnitude. This may indicate a rather low correlation between Magnitude and Longitude. The shape of the line is close to a perfect straight line. Therefore, we may assume that there is no hint that the relationship between the response variable and the latitude predictor maybe non-linear.


We can also look at in 3 dimension interactive plot of the relationship between the longitude and latitude in terms of our dependent variable, magnitude. Logically, we get the results, as above plotting the longitude and latitude with the dependent variable magnitude if we turn the graph towards the corresponding axes. On the third dimension turning the graph in the angle having Longitude on the x-axis and Latitude on the y axis, not surprisingly the distribution of the data points corresponds the world map.



Let`s continue our graphical analysis with the variable Focal Depth:



In case of the focal debt we can see that the values have a high density between 0 and 100 km. We see a few values lying out which may be high leverage points, which means their change or removal influences more our model as the other data points. The distribution of the data therefore is a left skewed.


Finally we plot the variable Year as continuous (scatter plot) as well as categorical (box plot) variable. As we saw before the distribution of this variable is rather right skewed.



Graph A: The above scatter plot we have plotted the magnitude by year. It is clearly visible that in the early years between 1900 and 1950 are less data points visible on the chart. After the 1975 the number of data points increases. We can also see more lower values below the value of 3 magnitude. This may be explained with the advancement of the measuring technologies, or distribution of these technologies throughout the different regions. Similarly as above, the smoother on this graph is close to a perfect straight line. Therefore, we may assume that there is no hint that the relationship between the response variable and the latitude predictor maybe non-linear.

Graph B: On the second plot, we have factorized the independent variable year, and taking it as a categorical variable. Here it is clearly visible the increasing of the variance given the fiskers of the box plot getting longer towards the later years. The number of outlier values in each direction (towards the minimum and maximum) are also increasing with time towards today. Possible explanation for this increasing of variance could be that the methods and equipment of the measurements have improved throughout the times, so more sensible equipment records more data points which have a lower value of magnitude.

Colinearity

Now we are looking at the correlation matrix, to check whether there is a co-linearity between any of our variables above.



As you can see on the correlation matrix, there is no colinearity between the variables. Most of the values are close to 0. Interestingly we can see a weak correlation between longitude and focal depth as we may have expected during our graphical analysis.


3.4 Fitting the Linear Model

Let us fit the linear model with all 4 variables: Magnitude as Dependent, Longitude, Latitude, Focal debt and Years.

Before doing so we need to analyse the variables:
- Magnitude, Latitude, Longitude, Focal Depth -> continuous
- Year: can be seen as continuous or categorical variable, which has more than 2 levels. - To see if Latitude and Longitude have an interaction, in the third model we will add a niteraction term

Given that to test a categorical variable with more than two levels the drop1() function must be used - we cannot use the summary function if we factorize the Year variable.

For continuous variables can equivalently be tested with the drop1() function (i.e. via F-tests) and for a continuous variable the results of a t-test or a F-test are identical.

Therefore we will use the drop1() function to fit the linear models below:

  1. taking the year as factorized as a categorical variable
  2. taking the year as continuous variable
  3. adding an interaction with Latitude and Longitude to the 2nd model.

We use the drop1() function to fit the linear models below:

## Single term deletions
## 
## Model:
## Mag ~ Focal.Depth..km. + Latitude + Longitude + factor(Year)
##                   Df Sum of Sq     RSS     AIC F value    Pr(>F)    
## <none>                          931.69 -205.79                      
## Focal.Depth..km.   1    29.149  960.84 -167.31 37.2620 1.395e-09 ***
## Latitude           1    60.941  992.63 -124.54 77.9028 < 2.2e-16 ***
## Longitude          1     6.288  937.98 -198.96  8.0381  0.004658 ** 
## factor(Year)     119   218.876 1150.57 -166.53  2.3512 6.330e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Single term deletions
## 
## Model:
## Mag ~ Focal.Depth..km. + Latitude + Longitude + Year
##                  Df Sum of Sq    RSS     AIC  F value    Pr(>F)    
## <none>                        1029.7 -310.35                       
## Focal.Depth..km.  1    34.365 1064.1 -269.21  43.6856 5.598e-11 ***
## Latitude          1    64.083 1093.8 -233.01  81.4643 < 2.2e-16 ***
## Longitude         1     7.477 1037.2 -302.84   9.5056  0.002091 ** 
## Year              1   120.852 1150.6 -166.53 153.6306 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Single term deletions
## 
## Model:
## Mag ~ Focal.Depth..km. + Year + Latitude + Longitude + (Latitude * 
##     Longitude)
##                    Df Sum of Sq    RSS     AIC  F value    Pr(>F)    
## <none>                          1027.4 -311.33                       
## Focal.Depth..km.    1    31.559 1058.9 -273.57  40.1797 3.182e-10 ***
## Year                1   118.555 1145.9 -169.83 150.9379 < 2.2e-16 ***
## Latitude:Longitude  1     2.337 1029.7 -310.35   2.9751   0.08479 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Our results show the all variables have a relevant effect on the response variable. We can see that longitude has a lower effect, which is to be explained the very flat regression line between magnitude and longitude, however there is no evidence that there is an interaction between lonitude and lattitude in our model.

Based on the AIC values, the first model (with factor(Year)) has a lower AIC (-205.79) compared to the second model (-310.35 or -311.33). In general, a lower AIC indicates a better fit and a more parsimonious model. Therefore, the first model with the factorized Year variable is considered a better fit based on the AIC criterion.

Looking at more detailed analysis of the residuals of the linear model with the factorized Year variable, we can see the following:


Residuals vs Fitted Values

The red line is linear, however the majority of our data points between 5 and 7 divide from the line. This may indicate some non-linearity in our data and may need a further investigation and optimazation.

Normal Q-Q

This chart we can see how the distribution of the data is. In case the most data points lie in the line, the data is normally distributed, as we took as a prerequisite assumption at the beginning of our the graphical analysis.

Scale Location

This chart we can see that there is heteroskedasticity in our data. The red line is approximately horizontal, this means that the average magnitude of the standardized residuals isn’t changing much as a function of the fitted values. However the distribution of the data points concentrates around the values between 5-7 while in lower and higher values we have significantly less data points. This indicates heteroskedasticity.

Residuals vs Leverage

In graphical analysis we could see some data points “far” away, or outlying. This chart shows which outliers are high leverage points, which means their change or removal influences more our model as the other data points. We have few such a points which lie outside of the line indicating the Cook`s distance.

Conclusion

Fitting a simple multiple linear regression model gives a good overall picture. Based on our findings regarding our response variables, we see a need for further investigation with different, more complex statistical models. In the next chapter we will continue our analysis in more detail on the relationship of the magnitude and the regional differences.


4. Generalised Linear Model set to Poisson


4.1 Characteristics of a GLM set to Poisson Distribution

A Generalized Linear Model is the same as a Linear Model with a link set to 1 and assumes a normal (Gaussian) distribution of the data. A GLM set Poisson on the other side does not assume normal distribution, but rather a Poisson distribution as visible on in the plot on the left-hand side. The link function of a GLM set to Poisson is the natural log. It is therefore suitable to be used to analyze data doesn’t have a normal distribution.


xxxxxxxx(add plot picture of gaussian and poisson)

A key requirement for a GLM set to Poisson is that mean and variance of the data are equal. However, in real-case scenarios, this is often not the case. Therefore, the “quasipoisson” family will be considered in the following analysis.


4.2 Research Question

Another key requirement of fitting a GLM set to Poisson or Quasipoisson is the characteristic of the predictor in the model. The predictor can only be count data, for which there are plenty of variables given for this dataset. However, the presumably most sensible choice would be to analyze which variables could have a significant influence on the magnitude of an earthquake. Having created the magnitude column without decimals, it would allow to fit it into a Quasipoisson GLM and would still not loose its’ ability to be interpreted.


4.3 Fitting the Poisson GLM

Reviewing the factual background which the available variables in the dataset are based on, it would not make much sense to analyze a possible influence on the magnitude in variables that state the effects after an earthquake. This means that variables such as deaths, injuries, or material damages that occur as after-effects of an earthquake should not be considered for this model. Based on the availability in this dataset, variables that could explain a magnitude, on the other side, are the focal depth of the epicenter, the region in which an earthquake occured, and the specific country.


Fitting the previewed model could be written in this form: xxxxxxxxxx


4.4 Model Interpretation and Evaluation

Fitting the presented GLM yields the following results:


## 
## Call:
## glm(formula = Mag.full ~ Focal.Depth..km. + factor(Region) + 
##     factor(Country), family = "quasipoisson", data = eqdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4214  -0.2295   0.0000   0.2070   1.2435  
## 
## Coefficients: (7 not defined because of singularities)
##                                                       Estimate Std. Error
## (Intercept)                                          1.7803666  0.1061740
## Focal.Depth..km.                                     0.0005295  0.0001131
## factor(Region)15                                    -0.1778126  0.1957260
## factor(Region)30                                    -0.1929452  0.1151957
## factor(Region)40                                    -0.0856511  0.1096526
## factor(Region)50                                     0.2330846  0.1521217
## factor(Region)60                                     0.1295872  0.2044565
## factor(Region)90                                     0.4633623  0.2804822
## factor(Region)100                                    0.2720398  0.2944492
## factor(Region)110                                    0.0023908  0.1838510
## factor(Region)120                                   -0.6185931  0.1404343
## factor(Region)130                                   -0.1714582  0.1957374
## factor(Region)140                                   -0.0495129  0.1074983
## factor(Region)150                                   -0.0458624  0.1096881
## factor(Region)160                                    0.3042599  0.2647040
## factor(Region)170                                    0.1855045  0.2062480
## factor(Country)ALBANIA                               0.1241380  0.1708598
## factor(Country)ALGERIA                               0.0105163  0.1686687
## factor(Country)ARGENTINA                            -0.2837274  0.2481127
## factor(Country)ARMENIA                               0.0898851  0.1097319
## factor(Country)AUSTRALIA                            -0.3617286  0.2415320
## factor(Country)AUSTRIA                               0.4423691  0.1883924
## factor(Country)AZERBAIJAN                           -0.0058779  0.0829789
## factor(Country)AZORES (PORTUGAL)                     0.0908080  0.1821738
## factor(Country)BANGLADESH                           -0.3083658  0.1896329
## factor(Country)BELGIUM                               0.4423691  0.1883924
## factor(Country)BHUTAN                               -0.1256079  0.2304098
## factor(Country)BOLIVIA                              -0.2844383  0.2557778
## factor(Country)BOSNIA-HERZEGOVINA                   -0.0114568  0.1801525
## factor(Country)BRAZIL                               -0.6235871  0.2631271
## factor(Country)BULGARIA                             -0.0854296  0.1866154
## factor(Country)BURUNDI                              -0.3993676  0.2122961
## factor(Country)CHILE                                -0.1627431  0.2432796
## factor(Country)CHINA                                 0.0949732  0.0416871
## factor(Country)COLOMBIA                             -0.3357137  0.2438615
## factor(Country)CONGO                                -0.0249045  0.1230058
## factor(Country)COSTA RICA                           -0.2768757  0.2754511
## factor(Country)CROATIA                               0.1013439  0.1785386
## factor(Country)CUBA                                 -0.4652078  0.2998726
## factor(Country)CYPRUS                                0.0878932  0.1983293
## factor(Country)DJIBOUTI                              0.0050384  0.1838535
## factor(Country)DOMINICAN REPUBLIC                   -0.4609784  0.2805145
## factor(Country)ECUADOR                              -0.3116361  0.2399532
## factor(Country)EGYPT                                 0.2317871  0.1848180
## factor(Country)EL SALVADOR                          -0.3158728  0.2787182
## factor(Country)ETHIOPIA                             -0.0066114  0.1501198
## factor(Country)FIJI                                 -0.1915863  0.2319116
## factor(Country)FRANCE                                0.2192220  0.1592212
## factor(Country)GEORGIA                               0.0039415  0.0832131
## factor(Country)GHANA                                 0.0045089  0.1838528
## factor(Country)GREECE                                0.1480140  0.1662053
## factor(Country)GUADELOUPE                           -0.4593829  0.2999465
## factor(Country)GUATEMALA                            -0.2945461  0.2751315
## factor(Country)GUINEA                                0.0055680  0.1838542
## factor(Country)HAITI                                -0.4982050  0.2684364
## factor(Country)HONDURAS                             -0.3361783  0.2794437
## factor(Country)INDIA                                -0.2467667  0.1769742
## factor(Country)INDONESIA                            -0.1401397  0.1755014
## factor(Country)IRAN                                 -0.0257931  0.0218813
## factor(Country)IRAQ                                 -0.1351837  0.1653179
## factor(Country)ISRAEL                                0.0434311  0.1510788
## factor(Country)ITALY                                 0.0603110  0.1663581
## factor(Country)JAMAICA                              -0.4599125  0.2999395
## factor(Country)JAPAN                                 0.2803887  0.0482289
## factor(Country)KAZAKHSTAN                            0.0796985  0.0908085
## factor(Country)KENYA                                -0.4115469  0.2122960
## factor(Country)KYRGYZSTAN                            0.1327375  0.0885863
## factor(Country)LIBYA                                 0.0042363  0.2325560
## factor(Country)MACEDONIA                            -0.0083058  0.1898881
## factor(Country)MADAGASCAR                           -0.3042227  0.2400305
## factor(Country)MALAWI                               -0.0535486  0.1386389
## factor(Country)MALAYSIA                             -0.3678367  0.2116401
## factor(Country)MARTINIQUE                           -0.3804259  0.2939640
## factor(Country)MEXICO                                0.1478613  0.0345247
## factor(Country)MONGOLIA                              0.3672515  0.1327971
## factor(Country)MONTENEGRO                            0.0886893  0.1983280
## factor(Country)MOROCCO                               0.1240780  0.1870667
## factor(Country)MOZAMBIQUE                            0.1597187  0.1748806
## factor(Country)MYANMAR (BURMA)                      -0.1368209  0.1813946
## factor(Country)NEPAL                                -0.0905729  0.1668025
## factor(Country)NETHERLANDS                           0.4365442  0.1884020
## factor(Country)NEW ZEALAND                          -0.1417089  0.1843567
## factor(Country)NICARAGUA                            -0.3554857  0.2856741
## factor(Country)PAKISTAN                             -0.2122963  0.1769209
## factor(Country)PANAMA                               -0.2752133  0.2944044
## factor(Country)PAPUA NEW GUINEA                     -0.0940749  0.1793433
## factor(Country)PERU                                 -0.3132049  0.2429827
## factor(Country)PHILIPPINES                          -0.1270827  0.1781497
## factor(Country)POLAND                                       NA         NA
## factor(Country)PORTUGAL                              0.3253520  0.2153165
## factor(Country)ROMANIA                              -0.0084114  0.1642379
## factor(Country)RUSSIA                               -0.0444317  0.0849062
## factor(Country)RWANDA                               -0.3195014  0.1472035
## factor(Country)SERBIA                               -0.0031797  0.1898810
## factor(Country)SLOVENIA                             -0.0037067  0.2325556
## factor(Country)SOLOMON ISLANDS                      -0.0102244  0.1870868
## factor(Country)SOUTH AFRICA                         -0.2824666  0.1225719
## factor(Country)SOUTH KOREA                           0.0151326  0.1704771
## factor(Country)SOUTH SUDAN                           0.1576005  0.1748781
## factor(Country)SPAIN                                        NA         NA
## factor(Country)SUDAN                                        NA         NA
## factor(Country)TAIWAN                                0.2310434  0.0508649
## factor(Country)TAJIKISTAN                            0.0163963  0.0509633
## factor(Country)TANZANIA                             -0.1831284  0.1291323
## factor(Country)THAILAND                             -0.1213716  0.2304576
## factor(Country)TONGA                                -0.0543806  0.2247658
## factor(Country)TRINIDAD AND TOBAGO                  -0.6618267  0.3071396
## factor(Country)TURKEY                                       NA         NA
## factor(Country)TURKMENISTAN                          0.2381294  0.0847854
## factor(Country)UGANDA                                       NA         NA
## factor(Country)UKRAINE                                      NA         NA
## factor(Country)USA                                          NA         NA
## factor(Country)USA TERRITORY                        -0.4572648  0.2999746
## factor(Country)UZBEKISTAN                            0.0893615  0.1097218
## factor(Country)VANUATU                              -0.0392907  0.2022383
## factor(Country)VENEZUELA                            -0.4205530  0.2510665
## factor(Country)WALLIS AND FUTUNA (FRENCH TERRITORY) -0.1794070  0.2320149
## factor(Country)YEMEN                                -0.3044883  0.2099830
##                                                     t value Pr(>|t|)    
## (Intercept)                                          16.768  < 2e-16 ***
## Focal.Depth..km.                                      4.680 3.19e-06 ***
## factor(Region)15                                     -0.908  0.36381    
## factor(Region)30                                     -1.675  0.09421 .  
## factor(Region)40                                     -0.781  0.43489    
## factor(Region)50                                      1.532  0.12573    
## factor(Region)60                                      0.634  0.52632    
## factor(Region)90                                      1.652  0.09879 .  
## factor(Region)100                                     0.924  0.35573    
## factor(Region)110                                     0.013  0.98963    
## factor(Region)120                                    -4.405 1.15e-05 ***
## factor(Region)130                                    -0.876  0.38123    
## factor(Region)140                                    -0.461  0.64517    
## factor(Region)150                                    -0.418  0.67594    
## factor(Region)160                                     1.149  0.25061    
## factor(Region)170                                     0.899  0.36861    
## factor(Country)ALBANIA                                0.727  0.46764    
## factor(Country)ALGERIA                                0.062  0.95030    
## factor(Country)ARGENTINA                             -1.144  0.25304    
## factor(Country)ARMENIA                                0.819  0.41287    
## factor(Country)AUSTRALIA                             -1.498  0.13449    
## factor(Country)AUSTRIA                                2.348  0.01903 *  
## factor(Country)AZERBAIJAN                            -0.071  0.94354    
## factor(Country)AZORES (PORTUGAL)                      0.498  0.61824    
## factor(Country)BANGLADESH                            -1.626  0.10419    
## factor(Country)BELGIUM                                2.348  0.01903 *  
## factor(Country)BHUTAN                                -0.545  0.58575    
## factor(Country)BOLIVIA                               -1.112  0.26634    
## factor(Country)BOSNIA-HERZEGOVINA                    -0.064  0.94930    
## factor(Country)BRAZIL                                -2.370  0.01795 *  
## factor(Country)BULGARIA                              -0.458  0.64719    
## factor(Country)BURUNDI                               -1.881  0.06019 .  
## factor(Country)CHILE                                 -0.669  0.50365    
## factor(Country)CHINA                                  2.278  0.02289 *  
## factor(Country)COLOMBIA                              -1.377  0.16887    
## factor(Country)CONGO                                 -0.202  0.83959    
## factor(Country)COSTA RICA                            -1.005  0.31502    
## factor(Country)CROATIA                                0.568  0.57039    
## factor(Country)CUBA                                  -1.551  0.12108    
## factor(Country)CYPRUS                                 0.443  0.65772    
## factor(Country)DJIBOUTI                               0.027  0.97814    
## factor(Country)DOMINICAN REPUBLIC                    -1.643  0.10058    
## factor(Country)ECUADOR                               -1.299  0.19428    
## factor(Country)EGYPT                                  1.254  0.21004    
## factor(Country)EL SALVADOR                           -1.133  0.25731    
## factor(Country)ETHIOPIA                              -0.044  0.96488    
## factor(Country)FIJI                                  -0.826  0.40890    
## factor(Country)FRANCE                                 1.377  0.16882    
## factor(Country)GEORGIA                                0.047  0.96223    
## factor(Country)GHANA                                  0.025  0.98044    
## factor(Country)GREECE                                 0.891  0.37335    
## factor(Country)GUADELOUPE                            -1.532  0.12590    
## factor(Country)GUATEMALA                             -1.071  0.28458    
## factor(Country)GUINEA                                 0.030  0.97584    
## factor(Country)HAITI                                 -1.856  0.06370 .  
## factor(Country)HONDURAS                              -1.203  0.22920    
## factor(Country)INDIA                                 -1.394  0.16346    
## factor(Country)INDONESIA                             -0.799  0.42473    
## factor(Country)IRAN                                  -1.179  0.23872    
## factor(Country)IRAQ                                  -0.818  0.41368    
## factor(Country)ISRAEL                                 0.287  0.77380    
## factor(Country)ITALY                                  0.363  0.71701    
## factor(Country)JAMAICA                               -1.533  0.12545    
## factor(Country)JAPAN                                  5.814 7.82e-09 ***
## factor(Country)KAZAKHSTAN                             0.878  0.38031    
## factor(Country)KENYA                                 -1.939  0.05279 .  
## factor(Country)KYRGYZSTAN                             1.498  0.13429    
## factor(Country)LIBYA                                  0.018  0.98547    
## factor(Country)MACEDONIA                             -0.044  0.96512    
## factor(Country)MADAGASCAR                            -1.267  0.20525    
## factor(Country)MALAWI                                -0.386  0.69938    
## factor(Country)MALAYSIA                              -1.738  0.08246 .  
## factor(Country)MARTINIQUE                            -1.294  0.19587    
## factor(Country)MEXICO                                 4.283 1.99e-05 ***
## factor(Country)MONGOLIA                               2.766  0.00577 ** 
## factor(Country)MONTENEGRO                             0.447  0.65482    
## factor(Country)MOROCCO                                0.663  0.50728    
## factor(Country)MOZAMBIQUE                             0.913  0.36127    
## factor(Country)MYANMAR (BURMA)                       -0.754  0.45083    
## factor(Country)NEPAL                                 -0.543  0.58723    
## factor(Country)NETHERLANDS                            2.317  0.02067 *  
## factor(Country)NEW ZEALAND                           -0.769  0.44224    
## factor(Country)NICARAGUA                             -1.244  0.21360    
## factor(Country)PAKISTAN                              -1.200  0.23039    
## factor(Country)PANAMA                                -0.935  0.35007    
## factor(Country)PAPUA NEW GUINEA                      -0.525  0.59999    
## factor(Country)PERU                                  -1.289  0.19765    
## factor(Country)PHILIPPINES                           -0.713  0.47577    
## factor(Country)POLAND                                    NA       NA    
## factor(Country)PORTUGAL                               1.511  0.13104    
## factor(Country)ROMANIA                               -0.051  0.95916    
## factor(Country)RUSSIA                                -0.523  0.60086    
## factor(Country)RWANDA                                -2.170  0.03017 *  
## factor(Country)SERBIA                                -0.017  0.98664    
## factor(Country)SLOVENIA                              -0.016  0.98729    
## factor(Country)SOLOMON ISLANDS                       -0.055  0.95643    
## factor(Country)SOUTH AFRICA                          -2.304  0.02136 *  
## factor(Country)SOUTH KOREA                            0.089  0.92928    
## factor(Country)SOUTH SUDAN                            0.901  0.36766    
## factor(Country)SPAIN                                     NA       NA    
## factor(Country)SUDAN                                     NA       NA    
## factor(Country)TAIWAN                                 4.542 6.12e-06 ***
## factor(Country)TAJIKISTAN                             0.322  0.74772    
## factor(Country)TANZANIA                              -1.418  0.15641    
## factor(Country)THAILAND                              -0.527  0.59853    
## factor(Country)TONGA                                 -0.242  0.80887    
## factor(Country)TRINIDAD AND TOBAGO                   -2.155  0.03137 *  
## factor(Country)TURKEY                                    NA       NA    
## factor(Country)TURKMENISTAN                           2.809  0.00506 ** 
## factor(Country)UGANDA                                    NA       NA    
## factor(Country)UKRAINE                                   NA       NA    
## factor(Country)USA                                       NA       NA    
## factor(Country)USA TERRITORY                         -1.524  0.12769    
## factor(Country)UZBEKISTAN                             0.814  0.41556    
## factor(Country)VANUATU                               -0.194  0.84599    
## factor(Country)VENEZUELA                             -1.675  0.09418 .  
## factor(Country)WALLIS AND FUTUNA (FRENCH TERRITORY)  -0.773  0.43952    
## factor(Country)YEMEN                                 -1.450  0.14730    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 0.1352037)
## 
##     Null deviance: 238.52  on 1313  degrees of freedom
## Residual deviance: 166.92  on 1203  degrees of freedom
##   (155 observations deleted due to missingness)
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4
## Coefficient for Focal Depth: 1.00053
## Coefficient for Region 120: 0.5387018
## Coefficient for Country Japan: 1.323644


Focal Depth: As expected, the focal depth of an earthquake has statistically significant influence on the magnitude of an earthquake. The model reveals that increasing the focal depth by 1km, it would result in a magnitude higher by 0.05%.

plot(eqdata$Focal.Depth..km., eqdata$Mag, title(main = "Magnitude - Focal Depth"),xlab =" Focal Depth in km", ylab = "Magnitude")


Nevertheless, plotting focal depth and magnitude reveals that many earthquakes, especially also strong ones, happen at a low depth more frequently. This very slow increase in magnitude the deeper the epicenter is situated, could be traced back to the few, but strong, earthquakes that happened at a very high focal depth.

Region 120:

Also, high significance for Region 120 can be visible. Looking at the coefficient, it can be seen that in Region 120, countries get in average around 53.9% lower magnitudes than other regions. Being reminded that region 120 represents Northern and Western Europe, this outcome seems very plausible as Europe is not known for earthquakes with a strong magnitude.

Country Japan:

Looking at the country-wise significance, it shows that especially the grounds at the location of Japan, Mexico, Mongolia, Taiwan, and Turkmenistan show strong statistically significant patterns regarding the magnitude of an earthquake. Indeed, combined with common knowledge, e.g., Japan happens to be known for its pattern in high earthquake magnitudes. The model reveals that Japan has earthquakes that are on average 32.4% at a stronger magnitude than in other countries.

Evaluation

The summary shows that the dispersion parameter is lower than 1. This implies that the variance increases slower than linearly. xxxxx

plot(glm.fit, which = 2)


The Q-Q Plot of the model shows no consistent lay-offs which would speak for a moderate fit. However, summarizing the analysis of this model, the residual deviance and its degrees of freedom differ greatly which is problematic. Therefore, in order to be able to fit a model on the magnitudes of an earthquake which shows high accuracy, more extensive data about factors that have an influential role on a magnitude should be considered. Further research has shown that these variables could be the intrinsic quality (coefficient of friction in the rock), the rupture area (whether the epicenter is in a subduction zone), the average displacement across the rupture area, and the directivity (energy release in the direction of movement).

#Esin: that's the LM that I mentioned in the chat
#try lm
# lm.fit <- lm(sqrt(Deaths) ~ Focal.Depth..km. + factor(Mag)+ + factor(Country) + factor(MMI.Int) , data = eqdata)
# summary(lm.fit)
# 
# plot(lm.fit)

5. Generalised Linear Model set to Binomial


5.1. Characteristics of a GLM set to Binomial


5.2. Research Question


5.3. Fitting the Binomial GLM


5.4. Model Interpretation and Evaluation


6. Generalised Additive Model


6.1. Characteristics of a GAM

GAM is the model that are the extension to smoothing splines and enables to fit models which contain several predictors simultaneously. Advantages of a gam model are, modelling non linear relationships, modelling multiple predictors and interaction effects, further on, gam is also a robust model for handling outliers.

Difference Generalized Linear Model (GLM) vs. Generalized Additive Model (GAM)

GAM does not assume a priori any specific form of this relationship, therefore can be used to reveal and estimate non-linear effects of the covariate on the dependent variable. GAMs assume that the relationship between the response variable and predictors is additive, meaning that the effect of each predictor is independent of the others.

This allows for more flexibility in modeling complex relationships without explicitly specifying interactions. GAMs can accommodate both continuous and categorical predictor variables. Categorical variables are typically represented by dummy variables or factor levels.

6.2. Research Question

Given the flexibility of the GAM model,in this section we will fit a GAM model to see how magnitude, intensity, focal debt and regional differences are influencing the levels of number of death caused by an earthquake. We aim with our model to be able to make predictions for the future.

Assessing the predictive performance of a model: Cross validation

In the previous chapters we were doing explanatory analysis about the response variables influencing the magnitude of an earthquake. In this chapter we will fit a model to predict the caused total deaths expected based on Magnitude, Intensity, Focal depth Region or Country.

We will proceed with out of sample method in this chapter, therefore we split the data in train and test parts. This will enable us to assess and compare the predictive performance of our models at the end of this chapter.

As the na values caused issues with the prediction functions we have removed them subletting the data set. As well we have splitted the dataset balanced, to make sure we including each level of factors from all factorized variables, such as Country, Region, MMI.Int. (Intensity)

6.3. Fitting a GAM

In our data set we have 2 variables indicating the death numbers caused by an earthquake one with the count numbers, one with categorical levels calculated from these numbers.

Fitting 4 Models with Countries and Regions:
- see results of model in drop down button
- checking with gam.check() function: 4 plots, which indicate bad fit see in drop-down button
- visualizing smoothed variables with vis.gam() function: indicates an overfit
(see below)

1. Response Variable: Fitting GAM Models with Death.Description Variable (categorical / 4 levels)

We will use the family multinom() given we have a categorical response variable.

Remarks to optimization:
- In GAM Model 1: Magnitude and Focal Depth and Latitude had edf value 1 therefore we have optimized our model and adding it as a simple linear regression to our model. (this will be visible in the 3-D plots below.)
- In GAM Model 2: Magnitude had edf value 1 therefore we have optimized our model and adding it as a simple linear regression to our model. (this will be visible in the 3-D plots below.)

Interpretation:

The summary output in both models below indicates that there there is a strong evidence that Magnitude has linear effects on the response variable. There is no evidence, that the values of Intensity2, Latitude or Focal Depth an influence on the number of death caused by an earthquake.

The models explain above 50% the overall variability, based on the R squared value.

There is no evidence that a region would have a strong effect a number of death number levels. We can observe that few of the country levels seem to have a medium or minor effect on the increasing of the death number levels. This countries are those where the reported earthquakes were more likely to have a high Magnitude, High frequency and a high value of intensity level.


## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Death.Description ~ Mag + factor(MMI.Int) + Focal.Depth..km. + 
##     Country + s(Longitude) + Latitude
## 
## Parametric coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.031812   1.492430   0.691  0.49003    
## Mag                        0.398088   0.086252   4.615 6.51e-06 ***
## factor(MMI.Int)3          -0.672688   0.999219  -0.673  0.50148    
## factor(MMI.Int)4          -0.675281   0.940576  -0.718  0.47352    
## factor(MMI.Int)5          -0.479768   0.924403  -0.519  0.60426    
## factor(MMI.Int)6          -0.154338   0.891187  -0.173  0.86266    
## factor(MMI.Int)7          -0.465643   0.892666  -0.522  0.60243    
## factor(MMI.Int)8          -0.181451   0.892043  -0.203  0.83899    
## factor(MMI.Int)9           0.399538   0.909090   0.439  0.66072    
## factor(MMI.Int)10          0.996484   0.923527   1.079  0.28171    
## factor(MMI.Int)11          0.987616   0.954350   1.035  0.30182    
## factor(MMI.Int)12          1.340165   1.056533   1.268  0.20591    
## Focal.Depth..km.          -0.003028   0.001977  -1.532  0.12694    
## CountryALBANIA            -0.629901   1.002893  -0.628  0.53057    
## CountryALGERIA             0.796286   1.246676   0.639  0.52363    
## CountryARGENTINA          -1.893165   3.123693  -0.606  0.54507    
## CountryAZERBAIJAN         -0.358355   0.988727  -0.362  0.71735    
## CountryAZORES (PORTUGAL)   0.196407   2.028262   0.097  0.92294    
## CountryBOSNIA-HERZEGOVINA -0.245785   1.007522  -0.244  0.80749    
## CountryBULGARIA           -0.003861   1.119740  -0.003  0.99725    
## CountryCHILE              -2.853241   3.120194  -0.914  0.36144    
## CountryCHINA              -0.872799   0.538045  -1.622  0.10613    
## CountryCOLOMBIA           -1.235650   3.055190  -0.404  0.68626    
## CountryCONGO              -1.214185   1.282760  -0.947  0.34486    
## CountryCOSTA RICA         -2.187120   3.247008  -0.674  0.50125    
## CountryCROATIA            -0.329285   0.992263  -0.332  0.74030    
## CountryECUADOR            -1.870289   3.145004  -0.595  0.55264    
## CountryEGYPT              -1.573974   1.061829  -1.482  0.13962    
## CountryEL SALVADOR        -0.917165   3.335518  -0.275  0.78359    
## CountryGEORGIA             0.186082   0.998660   0.186  0.85235    
## CountryGREECE             -0.943995   0.831738  -1.135  0.25757    
## CountryGUATEMALA          -1.382891   3.344752  -0.413  0.67966    
## CountryHAITI               0.504578   3.076219   0.164  0.86985    
## CountryINDIA              -0.554313   0.585270  -0.947  0.34457    
## CountryINDONESIA          -1.718347   0.815651  -2.107  0.03622 *  
## CountryIRAN               -0.209938   0.447720  -0.469  0.63958    
## CountryITALY               0.537361   0.970464   0.554  0.58031    
## CountryJAPAN              -1.346873   0.671230  -2.007  0.04596 *  
## CountryMEXICO             -2.163609   3.469830  -0.624  0.53354    
## CountryMYANMAR (BURMA)    -1.689785   0.989843  -1.707  0.08914 .  
## CountryNEPAL              -0.431113   0.559794  -0.770  0.44201    
## CountryNEW ZEALAND        -2.769077   1.850259  -1.497  0.13587    
## CountryNICARAGUA          -2.006962   3.329538  -0.603  0.54725    
## CountryPAKISTAN           -1.140381   0.535840  -2.128  0.03438 *  
## CountryPAPUA NEW GUINEA   -2.463001   1.057238  -2.330  0.02069 *  
## CountryPERU               -2.162138   3.115810  -0.694  0.48843    
## CountryPHILIPPINES        -1.963620   0.711392  -2.760  0.00624 ** 
## CountryROMANIA            -0.410322   1.116929  -0.367  0.71368    
## CountryRUSSIA             -1.053414   0.655739  -1.606  0.10954    
## CountrySOLOMON ISLANDS    -2.898192   1.444043  -2.007  0.04592 *  
## CountrySOUTH AFRICA       -3.257048   1.597455  -2.039  0.04260 *  
## CountryTAIWAN             -0.797377   0.727407  -1.096  0.27414    
## CountryTAJIKISTAN         -1.477827   0.732638  -2.017  0.04484 *  
## CountryTURKEY             -0.122656   0.648330  -0.189  0.85011    
## CountryUSA                -2.523815   3.802130  -0.664  0.50749    
## CountryVENEZUELA          -0.687079   2.818780  -0.244  0.80764    
## Latitude                  -0.029778   0.014543  -2.048  0.04174 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df     F p-value
## s(Longitude) 3.031  3.929 0.688    0.56
## 
## R-sq.(adj) =  0.497   Deviance explained = 59.9%
## GCV = 0.86781  Scale est. = 0.68879   n = 291

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 8 iterations.
## The RMS GCV score gradient at convergence was 4.561216e-07 .
## The Hessian was positive definite.
## Model rank =  66 / 66 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                k'  edf k-index p-value
## s(Longitude) 9.00 3.03    1.06    0.84


## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Death.Description ~ Mag + factor(MMI.Int) + s(Focal.Depth..km.) + 
##     factor(Region) + s(Longitude) + s(Latitude)
## 
## Parametric coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.46745    1.12683  -1.302  0.19397    
## Mag                0.30012    0.07765   3.865  0.00014 ***
## factor(MMI.Int)3  -0.58284    0.96677  -0.603  0.54712    
## factor(MMI.Int)4  -0.55504    0.93252  -0.595  0.55223    
## factor(MMI.Int)5  -0.27552    0.89365  -0.308  0.75810    
## factor(MMI.Int)6   0.04286    0.87791   0.049  0.96110    
## factor(MMI.Int)7  -0.08073    0.87286  -0.092  0.92638    
## factor(MMI.Int)8   0.09976    0.87277   0.114  0.90909    
## factor(MMI.Int)9   0.61792    0.88518   0.698  0.48576    
## factor(MMI.Int)10  1.43992    0.89799   1.603  0.11004    
## factor(MMI.Int)11  1.49275    0.92191   1.619  0.10662    
## factor(MMI.Int)12  1.86848    1.03843   1.799  0.07313 .  
## factor(Region)15   1.36908    0.78148   1.752  0.08097 .  
## factor(Region)30   1.40580    0.86625   1.623  0.10584    
## factor(Region)40   1.56807    0.78868   1.988  0.04784 *  
## factor(Region)50   2.20897    1.14115   1.936  0.05399 .  
## factor(Region)60   1.35719    0.74614   1.819  0.07007 .  
## factor(Region)90   1.87391    0.99095   1.891  0.05974 .  
## factor(Region)100  0.15802    0.96643   0.164  0.87025    
## factor(Region)110  1.28117    0.94832   1.351  0.17788    
## factor(Region)130  0.94343    0.71903   1.312  0.19065    
## factor(Region)140  1.62289    0.71811   2.260  0.02466 *  
## factor(Region)150  0.09944    1.02996   0.097  0.92316    
## factor(Region)160  0.09758    0.83183   0.117  0.90671    
## factor(Region)170  0.79852    0.82434   0.969  0.33361    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                       edf Ref.df     F p-value  
## s(Focal.Depth..km.) 2.007  2.476 1.953  0.1064  
## s(Longitude)        2.173  2.823 0.465  0.6183  
## s(Latitude)         2.710  3.364 3.558  0.0118 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.483   Deviance explained = 53.8%
## GCV = 0.79455  Scale est. = 0.70747   n = 291

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 6 iterations.
## The RMS GCV score gradient at convergence was 1.031068e-06 .
## The Hessian was positive definite.
## Model rank =  52 / 52 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                       k'  edf k-index p-value
## s(Focal.Depth..km.) 9.00 2.01    1.15    1.00
## s(Longitude)        9.00 2.17    0.99    0.44
## s(Latitude)         9.00 2.71    1.08    0.88


2. Response Variable: Fitting GAM Models with Death (count variable)

We will use family = “quasipoisson” given our response valuable is a count data, we need a link with log function.

Interpretation:

These models with the count death numbers as response variable seem to have a higher value by explaining the variability of both models. (higher than 90 %) However the summary outputs indicate in both cases that there there is strong evidence that almost all regions and most of the country levels as well as some of the intensity levels may have a strong effect on the outcome variable.

Looking at the 3 dimensional plots with our linear and smoothed variables, it is visible that the plains are in most of the plots non-linear very curvy, which indicates an overfit.


## 
## Family: quasipoisson 
## Link function: log 
## 
## Formula:
## Deaths ~ s(Mag) + factor(MMI.Int) + s(Focal.Depth..km.) + s(Longitude) + 
##     s(Latitude) + Country
## 
## Parametric coefficients:
##                            Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)               -12.66294    1.27059   -9.966  < 2e-16 ***
## factor(MMI.Int)3           -2.00628    1.31444   -1.526 0.128470    
## factor(MMI.Int)4           -0.58942    1.22018   -0.483 0.629568    
## factor(MMI.Int)5           -1.43387    1.20798   -1.187 0.236605    
## factor(MMI.Int)6           -0.58815    1.19437   -0.492 0.622940    
## factor(MMI.Int)7           -0.03741    1.19501   -0.031 0.975053    
## factor(MMI.Int)8            1.59952    1.19484    1.339 0.182155    
## factor(MMI.Int)9            3.63301    1.19498    3.040 0.002673 ** 
## factor(MMI.Int)10           4.54556    1.19494    3.804 0.000188 ***
## factor(MMI.Int)11           5.44309    1.19494    4.555 8.99e-06 ***
## factor(MMI.Int)12           5.35756    1.19503    4.483 1.22e-05 ***
## CountryALBANIA             -5.61488    0.10914  -51.445  < 2e-16 ***
## CountryALGERIA            -14.55564    0.16396  -88.774  < 2e-16 ***
## CountryARGENTINA           27.92018    1.65953   16.824  < 2e-16 ***
## CountryAZERBAIJAN          -0.80170    1.20760   -0.664 0.507517    
## CountryAZORES (PORTUGAL)    6.73441    1.37483    4.898 1.96e-06 ***
## CountryBOSNIA-HERZEGOVINA  -7.83084    0.25785  -30.370  < 2e-16 ***
## CountryBULGARIA             0.55315    0.69284    0.798 0.425573    
## CountryCHILE               24.92980    1.69268   14.728  < 2e-16 ***
## CountryCHINA                2.64922    0.07020   37.738  < 2e-16 ***
## CountryCOLOMBIA            64.11328    1.52312   42.093  < 2e-16 ***
## CountryCONGO               -8.19934    0.87804   -9.338  < 2e-16 ***
## CountryCOSTA RICA          68.82480    1.67224   41.157  < 2e-16 ***
## CountryCROATIA             -8.80365    0.30998  -28.400  < 2e-16 ***
## CountryECUADOR             60.35377    1.53870   39.224  < 2e-16 ***
## CountryEGYPT               -3.80736    0.36760  -10.357  < 2e-16 ***
## CountryEL SALVADOR         75.60591    1.57738   47.931  < 2e-16 ***
## CountryGEORGIA              1.31918    0.10313   12.792  < 2e-16 ***
## CountryGREECE             -10.78537    0.12034  -89.624  < 2e-16 ***
## CountryGUATEMALA           74.03828    1.57500   47.008  < 2e-16 ***
## CountryHAITI               69.76501    1.50763   46.275  < 2e-16 ***
## CountryINDIA               -1.73780    0.06024  -28.850  < 2e-16 ***
## CountryINDONESIA           -3.90418    0.13979  -27.929  < 2e-16 ***
## CountryIRAN                -2.10823    0.06356  -33.169  < 2e-16 ***
## CountryITALY               -6.62961    0.08950  -74.070  < 2e-16 ***
## CountryJAPAN               -0.64251    0.10568   -6.080 5.80e-09 ***
## CountryMEXICO              65.81186    1.58644   41.484  < 2e-16 ***
## CountryMYANMAR (BURMA)      2.09023    0.63073    3.314 0.001088 ** 
## CountryNEPAL                2.31579    0.06581   35.187  < 2e-16 ***
## CountryNEW ZEALAND        -62.97676    0.69175  -91.040  < 2e-16 ***
## CountryNICARAGUA           74.01394    1.78201   41.534  < 2e-16 ***
## CountryPAKISTAN            -9.12737    0.09006 -101.349  < 2e-16 ***
## CountryPAPUA NEW GUINEA    -8.29659    0.32773  -25.316  < 2e-16 ***
## CountryPERU                61.23470    1.54150   39.724  < 2e-16 ***
## CountryPHILIPPINES          0.86989    0.11189    7.774 3.64e-13 ***
## CountryROMANIA             -4.13944    0.84801   -4.881 2.12e-06 ***
## CountryRUSSIA               3.51887    0.09399   37.440  < 2e-16 ***
## CountrySOLOMON ISLANDS    -10.48487    0.37697  -27.813  < 2e-16 ***
## CountrySOUTH AFRICA       -50.35219    0.62915  -80.032  < 2e-16 ***
## CountryTAIWAN               0.21030    0.07733    2.720 0.007098 ** 
## CountryTAJIKISTAN           0.74628    0.22823    3.270 0.001262 ** 
## CountryTURKEY              -0.01639    0.07137   -0.230 0.818558    
## CountryUSA                 61.82973    1.61359   38.318  < 2e-16 ***
## CountryVENEZUELA           64.51919    1.40786   45.828  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                       edf Ref.df     F p-value    
## s(Mag)              8.733  8.953 13257  <2e-16 ***
## s(Focal.Depth..km.) 6.281  7.157  3497  <2e-16 ***
## s(Longitude)        8.533  8.896  4095  <2e-16 ***
## s(Latitude)         8.882  8.986  9049  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.98   Deviance explained = 95.4%
## GCV = 1377.4  Scale est. = 1.4226    n = 291

## 
## Method: GCV   Optimizer: outer newton
## full convergence after 10 iterations.
## Gradient range [-2.572647e-06,8.852332e-05]
## (score 1377.38 & scale 1.422571).
## Hessian positive definite, eigenvalue range [1.110256,5.916919].
## Model rank =  90 / 90 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                       k'  edf k-index p-value
## s(Mag)              9.00 8.73    0.95    0.20
## s(Focal.Depth..km.) 9.00 6.28    1.02    0.58
## s(Longitude)        9.00 8.53    0.97    0.30
## s(Latitude)         9.00 8.88    1.02    0.60


## 
## Family: quasipoisson 
## Link function: log 
## 
## Formula:
## Deaths ~ s(Mag) + factor(MMI.Int) + s(Focal.Depth..km.) + s(Longitude) + 
##     s(Latitude) + factor(Region)
## 
## Parametric coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -7.1105     5.9592  -1.193    0.234    
## factor(MMI.Int)3   -5.5975     6.2750  -0.892    0.373    
## factor(MMI.Int)4   -3.1414     5.8805  -0.534    0.594    
## factor(MMI.Int)5   -2.7801     5.7740  -0.481    0.631    
## factor(MMI.Int)6   -4.0615     5.7311  -0.709    0.479    
## factor(MMI.Int)7   -2.8676     5.7287  -0.501    0.617    
## factor(MMI.Int)8   -1.0134     5.7279  -0.177    0.860    
## factor(MMI.Int)9    0.7434     5.7282   0.130    0.897    
## factor(MMI.Int)10   0.6122     5.7281   0.107    0.915    
## factor(MMI.Int)11   1.0044     5.7287   0.175    0.861    
## factor(MMI.Int)12   2.8529     5.7284   0.498    0.619    
## factor(Region)15   -1.8347     1.5664  -1.171    0.243    
## factor(Region)30   10.6973     1.5697   6.815 8.01e-11 ***
## factor(Region)40    9.8124     1.5817   6.204 2.50e-09 ***
## factor(Region)50   20.0607     1.6052  12.497  < 2e-16 ***
## factor(Region)60    9.3533     1.5626   5.986 8.09e-09 ***
## factor(Region)90   18.7203     2.2318   8.388 4.79e-15 ***
## factor(Region)100  19.0388     2.2623   8.416 4.00e-15 ***
## factor(Region)110   7.2905     3.0327   2.404    0.017 *  
## factor(Region)130   6.3323     1.5504   4.084 6.09e-05 ***
## factor(Region)140  12.1598     1.5561   7.814 1.90e-13 ***
## factor(Region)150  10.5611     2.2218   4.753 3.51e-06 ***
## factor(Region)160  14.7653     2.2454   6.576 3.15e-10 ***
## factor(Region)170   9.4728     1.5707   6.031 6.35e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                       edf Ref.df     F p-value    
## s(Mag)              8.833  8.976 784.7  <2e-16 ***
## s(Focal.Depth..km.) 7.972  8.472 190.5  <2e-16 ***
## s(Longitude)        8.748  8.970 390.2  <2e-16 ***
## s(Latitude)         8.704  8.962 291.7  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.966   Deviance explained = 93.3%
## GCV = 1565.4  Scale est. = 32.68     n = 291

## 
## Method: GCV   Optimizer: outer newton
## full convergence after 9 iterations.
## Gradient range [1.882031e-06,0.0002423294]
## (score 1565.438 & scale 32.67988).
## Hessian positive definite, eigenvalue range [0.7592256,2.602717].
## Model rank =  60 / 60 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                       k'  edf k-index p-value    
## s(Mag)              9.00 8.83    0.94    0.18    
## s(Focal.Depth..km.) 9.00 7.97    0.96    0.24    
## s(Longitude)        9.00 8.75    0.82  <2e-16 ***
## s(Latitude)         9.00 8.70    1.01    0.56    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


6.4. Model Interpretation and Crossvalidation

Results of Cross Validation of each 4 GAM Models, With the mean of the R squared values:

## [1] "1. GAM Model: Death.Description response variable / Country:"
## [1] 0.02795743
## [1] "2. GAM Model: Death.Description response variable / Region:"
## [1] 0.03282622
## [1] "3. GAM Model: Deaths response variable / Country:"
## [1] 0.006905893
## [1] "4. GAM Model: Deaths response variable / Region:"
## [1] 0.01355729


Given our results we can conclude, that the 2. GAM Model, with the categorical response variable including the Regional factor levels has the best evidence to explain the variability in the highest proportion according to our above validation. Based on this model we have a strong evidence that the Magnitude, Intensity as well as the location (with strong evidence the Longitude) have a joint effect on our response variable, the level of death numbers.


7. Neural Network


8. Support Vector Machine

## Loading required package: lattice

## Warning: Slicing with a 1-column matrix was deprecated in dplyr 1.1.0.
## 
## Call:
## svm(formula = Type ~ Mag + Year, data = train, kernel = "radial", 
##     cost = 10, scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  455
## 
##  ( 246 202 5 2 )
## 
## 
## Number of Classes:  4 
## 
## Levels: 
##  Tsunami Volcano Both Neither


9. Optimisation Problem


Assessing the predictive performance of a model: Cross validation In-sample or out-of-sample performance: In sample: if we use all data in our model Out-of-sample: if we split the data in otder to cross validate or model on predictive performance.

Random splitting procedure is better in terms of performance and overcome overfitting:


10. Conclusion